AITopics | Jász-Nagykun-Szolnok County

Collaborating Authors

Jász-Nagykun-Szolnok County

Can a Neural Model Guide Fieldwork? A Case Study on Morphological Data Collection

Mahmudi, Aso, Herce, Borja, Amestica, Demian Inostroza, Scherbakov, Andreas, Hovy, Eduard, Vylomova, Ekaterina

arXiv.org Artificial IntelligenceDec-14-2024

Linguistic fieldwork is an important component in language documentation and preservation. However, it is a long, exhaustive, and time-consuming process. This paper presents a novel model that guides a linguist during the fieldwork and accounts for the dynamics of linguist-speaker interactions. We introduce a novel framework that evaluates the efficiency of various sampling strategies for obtaining morphological data and assesses the effectiveness of state-of-the-art neural models in generalising morphological structures. Our experiments highlight two key strategies for improving the efficiency: (1) increasing the diversity of annotated data by uniform sampling among the cells of the paradigm tables, and (2) using model confidence as a guide to enhance positive interaction by providing reliable predictions during annotation.

machine learning, natural language, prediction, (15 more...)

arXiv.org Artificial Intelligence

2409.14628

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
(12 more...)

Genre:

Research Report > New Finding (0.93)
Research Report > Promising Solution (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Why do language models perform worse for morphologically complex languages?

Arnett, Catherine, Bergen, Benjamin K.

arXiv.org Artificial IntelligenceNov-21-2024

Language models perform differently across languages. It has been previously suggested that morphological typology may explain some of this variability (Cotterell et al., 2018). We replicate previous analyses and find additional new evidence for a performance gap between agglutinative and fusional languages, where fusional languages, such as English, tend to have better language modeling performance than morphologically more complex languages like Turkish. We then propose and test three possible causes for this performance gap: morphological alignment of tokenizers, tokenization quality, and disparities in dataset sizes and measurement. To test the morphological alignment hypothesis, we present MorphScore, a tokenizer evaluation metric, and supporting datasets for 22 languages. We find some evidence that tokenization quality explains the performance gap, but none for the role of morphological alignment. Instead we find that the performance gap is most reduced when training datasets are of equivalent size across language types, but only when scaled according to the so-called "byte-premium" -- the different encoding efficiencies of different languages and orthographies. These results suggest that no language is harder or easier for a language model to learn on the basis of its morphological typology. Differences in performance can be attributed to disparities in dataset size. These results bear on ongoing efforts to improve performance for low-performing and under-resourced languages.

computational linguistic, linguistic, proceedings, (16 more...)

arXiv.org Artificial Intelligence

2411.14198

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(28 more...)

Genre:

Research Report > Experimental Study (0.69)
Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

OOVs in the Spotlight: How to Inflect them?

Sourada, Tomáš, Straková, Jana, Rosa, Rudolf

arXiv.org Artificial IntelligenceMay-28-2024

We focus on morphological inflection in out-of-vocabulary (OOV) conditions, an under-researched subtask in which state-of-the-art systems usually are less effective. We developed three systems: a retrograde model and two sequence-to-sequence (seq2seq) models based on LSTM and Transformer. For testing in OOV conditions, we automatically extracted a large dataset of nouns in the morphologically rich Czech language, with lemma-disjoint data splits, and we further manually annotated a real-world OOV dataset of neologisms. In the standard OOV conditions, Transformer achieves the best results, with increasing performance in ensemble with LSTM, the retrograde model and SIGMORPHON baselines. On the real-world OOV dataset of neologisms, the retrograde model outperforms all neural models. Finally, our seq2seq models achieve state-of-the-art results in 9 out of 16 languages from SIGMORPHON 2022 shared task data in the OOV evaluation (feature overlap) in the large data condition. We release the Czech OOV Inflection Dataset for rigorous evaluation in OOV conditions. Further, we release the inflection system with the seq2seq models as a ready-to-use Python library.

dataset, inflection, lemma, (16 more...)

arXiv.org Artificial Intelligence

2404.08974

Country:

North America > United States > Washington > King County > Seattle (0.05)
Europe > Germany > Berlin (0.04)
North America > United States > Pennsylvania (0.04)
(10 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

J-UniMorph: Japanese Morphological Annotation through the Universal Feature Schema

Matsuzaki, Kosuke, Taniguchi, Masaya, Inui, Kentaro, Sakaguchi, Keisuke

arXiv.org Artificial IntelligenceFeb-22-2024

We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the language's agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 inflected forms for each word and is primarily dominated by denominal verbs (i.e., [noun] +suru (do-PRS)). Morphologically, this form is equivalent to the verb suru (do). In contrast, J-UniMorph explores a much broader and more frequently used range of verb forms, offering 118 inflected forms for each word on average. It includes honorifics, a range of politeness levels, and other linguistic nuances, emphasizing the distinctive characteristics of the Japanese language. This paper presents detailed statistics and characteristics of J-UniMorph, comparing it with the Wiktionary Edition. We release J-UniMorph and its interactive visualizer publicly available, aiming to support cross-linguistic research and various applications.

expression, inflected form, verb, (14 more...)

arXiv.org Artificial Intelligence

2402.14411

Country:

Asia > Japan > Honshū > Tōhoku (0.05)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(6 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Exploring Linguistic Probes for Morphological Generalization

Kodner, Jordan, Khalifa, Salam, Payne, Sarah

arXiv.org Artificial IntelligenceOct-20-2023

SIGMORPHON and SIGMORPHON-UniMorph Three languages were chosen whose inflectional shared tasks (Cotterell et al., 2016, 2017, 2018; morphologies range from entirely fusional (English), McCarthy et al., 2019; Vylomova et al., 2020; Pimentel to mixed (Spanish), to mostly agglutinative et al., 2021; Kodner et al., 2022) as well (Swahili). In highly agglutinative languages, individual as in more targeted studies focused on specific languages features in a set tend to correspond to distinct or the generalization behavior of computational morphological patterns, so a model may generalize models (Goldman et al., 2022; Wiemerslage to unseen feature sets by mapping component et al., 2022; Kodner et al., 2023b; Guriel et al., features to their corresponding patterns. This is 2023; Kodner et al., 2023a), is to train on (lemma, exemplified by the Swahili example (1), in which inflection, features) triples and predict inflected most features correspond to individual morphemes; forms from held-out (lemma, features) only the person/number prefix maps to more than pairs.

computational linguistic, generalization, presentation style, (14 more...)

arXiv.org Artificial Intelligence

2310.13686

Country:

Asia > Middle East > Jordan (0.05)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > New York > Suffolk County > Stony Brook (0.04)
(9 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

Cahyawijaya, Samuel, Lovenia, Holy, Aji, Alham Fikri, Winata, Genta Indra, Wilie, Bryan, Mahendra, Rahmad, Wibisono, Christian, Romadhony, Ade, Vincentio, Karissa, Koto, Fajri, Santoso, Jennifer, Moeljadi, David, Wirawan, Cahya, Hudi, Frederikus, Parmonangan, Ivan Halim, Alfina, Ika, Wicaksono, Muhammad Satrio, Putra, Ilham Firdausi, Rahmadani, Samsul, Oenang, Yulianti, Septiandri, Ali Akbar, Jaya, James, Dhole, Kaustubh D., Suryani, Arie Ardiyanti, Putri, Rifki Afina, Su, Dan, Stevens, Keith, Nityasya, Made Nindyatama, Adilazuarda, Muhammad Farid, Ignatius, Ryan, Diandaru, Ryandito, Yu, Tiezheng, Ghifari, Vito, Dai, Wenliang, Xu, Yan, Damapuspita, Dyah, Tho, Cuk, Karo, Ichwanul Muslim Karo, Fatyanosa, Tirana Noor, Ji, Ziwei, Fung, Pascale, Neubig, Graham, Baldwin, Timothy, Ruder, Sebastian, Sujaini, Herry, Sakti, Sakriani, Purwarianti, Ayu

arXiv.org Artificial IntelligenceJul-21-2023

We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.

large language model, machine learning, natural language, (24 more...)

arXiv.org Artificial Intelligence

2212.09648

Country:

North America > United States > Texas > Dallas County > Dallas (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Timor-Leste (0.14)
(64 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Law (0.67)
Government (0.67)
Information Technology > Services (0.67)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(5 more...)

Add feedback

Morphological Inflection with Phonological Features

Guriel, David, Goldman, Omer, Tsarfaty, Reut

arXiv.org Artificial IntelligenceJun-21-2023

Recent years have brought great advances into solving morphological tasks, mostly due to powerful neural models applied to various tasks as (re)inflection and analysis. Yet, such morphological tasks cannot be considered solved, especially when little training data is available or when generalizing to previously unseen lemmas. This work explores effects on performance obtained through various ways in which morphological models get access to subcharacter phonological features that are the targets of morphological processes. We design two methods to achieve this goal: one that leaves models as is but manipulates the data to include features instead of characters, and another that manipulates models to take phonological features into account when building representations for phonemes. We elicit phonemic data from standard graphemic data using language-specific grammars for languages with shallow grapheme-to-phoneme mapping, and we experiment with two reinflection models over eight languages. Our results show that our methods yield comparable results to the grapheme-based baseline overall, with minor improvements in some of the languages. All in all, we conclude that patterns in character distributions are likely to allow models to infer the underlying phonological characteristics, even when phonemes are not explicitly represented.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2306.12581

Country:

North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(6 more...)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.31)

Add feedback

Morphological Inflection: A Reality Check

Kodner, Jordan, Payne, Sarah, Khalifa, Salam, Liu, Zoey

arXiv.org Artificial IntelligenceMay-24-2023

Morphological inflection is a popular task in sub-word NLP with both practical and cognitive applications. For years now, state-of-the-art systems have reported high, but also highly variable, performance across data sets and languages. We investigate the causes of this high performance and high variability; we find several aspects of data set creation and evaluation which systematically inflate performance and obfuscate differences between languages. To improve generalizability and reliability of results, we propose new data sampling and evaluation strategies that better reflect likely use-cases. Using these new strategies, we make new observations on the generalization abilities of current inflection systems.

artificial intelligence, natural language, verlap, (15 more...)

arXiv.org Artificial Intelligence

2305.15637

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
North America > United States > New York > Suffolk County > Stony Brook (0.04)
Asia > Middle East > Jordan (0.04)
(21 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

K-UniMorph: Korean Universal Morphology and its Feature Schema

Jo, Eunkyul Leah, Kim, Kyuwon, Wu, Xihan, Lim, KyungTae, Park, Jungyeul, Park, Chulwoo

arXiv.org Artificial IntelligenceMay-17-2023

We present in this work a new Universal Morphology dataset for Korean. Previously, the Korean language has been underrepresented in the field of morphological paradigms amongst hundreds of diverse world languages. Hence, we propose this Universal Morphological paradigms for the Korean language that preserve its distinct characteristics. For our K-UniMorph dataset, we outline each grammatical criterion in detail for the verbal endings, clarify how to extract inflected forms, and demonstrate how we generate the morphological schemata. This dataset adopts morphological feature schema from Sylak-Glassman et al. (2015) and Sylak-Glassman (2016) for the Korean language as we extract inflected verb forms from the Sejong morphologically analyzed corpus that is one of the largest annotated corpora for Korean. During the data creation, our methodology also includes investigating the correctness of the conversion from the Sejong corpus. Furthermore, we carry out the inflection task using three different Korean word forms: letters, syllables and morphemes. Finally, we discuss and describe future perspectives on Korean morphological paradigms and the dataset.

artificial intelligence, ending, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.06335

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France > Île-de-France > Paris > Paris (0.06)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
(16 more...)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

State-of-the-art generalisation research in NLP: A taxonomy and review

Hupkes, Dieuwke, Giulianelli, Mario, Dankers, Verna, Artetxe, Mikel, Elazar, Yanai, Pimentel, Tiago, Christodoulopoulos, Christos, Lasri, Karim, Saphra, Naomi, Sinclair, Arabella, Ulmer, Dennis, Schottmann, Florian, Batsuren, Khuyagbaatar, Sun, Kaiser, Sinha, Koustuv, Khalatbari, Leila, Ryskina, Maria, Frieske, Rita, Cotterell, Ryan, Jin, Zhijing

arXiv.org Artificial IntelligenceJan-9-2023

The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what 'good generalisation' entails and how it should be evaluated is not well understood, nor are there any evaluation standards for generalisation. In this paper, we lay the groundwork to address both of these issues. We present a taxonomy for characterising and understanding generalisation research in NLP. Our taxonomy is based on an extensive literature review of generalisation research, and contains five axes along which studies can differ: their main motivation, the type of generalisation they investigate, the type of data shift they consider, the source of this data shift, and the locus of the shift within the modelling pipeline. We use our taxonomy to classify over 400 papers that test generalisation, for a total of more than 600 individual experiments. Considering the results of this review, we present an in-depth analysis that maps out the current state of generalisation research in NLP, and we make recommendations for which areas might deserve attention in the future. Along with this paper, we release a webpage where the results of our review can be dynamically explored, and which we intend to update as new NLP generalisation studies are published. With this work, we aim to take steps towards making state-of-the-art generalisation testing the new status quo in NLP.

large language model, machine learning, reinforcement learning, (26 more...)

arXiv.org Artificial Intelligence

2210.0305

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Washington > King County > Seattle (0.14)
Europe > Italy > Tuscany > Florence (0.04)
(39 more...)

Genre: Research Report > New Finding (0.87)

Industry:

Media > News (1.00)
Education (1.00)
Information Technology > Security & Privacy (0.67)
Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
(6 more...)

Add feedback